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The paper uses mesoscopic, non-linear lattice dynamics based (Peyrard-Bishop-Dauxois, PBD) 
modeling to describe thermal properties of DNA below and near the denaturation temperature. 
Computationally efficient notation is introduced for the relevant statistical mechanics. Computed 
melting profiles of long and short heterogeneous sequences are presented, using a recently introduced 
reparametrization of the PBD model, and critically discussed. The statistics of extended open 
bubbles and bound clusters is formulated and results are presented for selected examples. 
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1. Introduction 

Thermal denaturation, i.e. separation of the two strands as a result of heating, is one of the 
oldest established physico-chemical results pertaining to the DNA molecule. The "melting" - 
of DNA was first observed [T] very soon after the determination of the double helical struc- 
ture. Theoretical descriptions in the sixties, proposed by Poland and Scheraga (PS) OE]' 
largely based on the concept of the helix-coil transition |4j , were subsequently refined and 
developed in considerable detail, incorporating known enthalpic data on controlled oligomer 
thermodynamics and, when combined with appropriate software, can claim significant pre- 
dictive power in regard to actual experimental melting profiles [36]. Helix-coil models adopt 
a mesoscopic description of DNA, describing the open (coil) or closed (helix) state of an indi- 
vidual base pair in terms of a discrete, Ising-type variable. By virtue of their construction 
they cannot describe any dynamic phenomena. 

An interesting alternative was proposed two decades ago by Peyrard, Bishop and Daux- 
ois (PBD) [71[H]. In accordance with contemporary soft mode concepts related to structural 
phase transitions, they attempted a reduced, also mesoscopic, lattice-dynamics motivated 
description of the double helix. In their case, the relevant degree of freedom is a transverse 
displacement which represents how far an individual base pair is from equilibrium. This is 
locally determined by a nonlinear (typically Morse-like) potential which accounts for the 
effective hydrogen bonding linking bases together. The motion of neighboring base pairs 
is coupled in a way which favors double-helical ordering (DNA stacking interaction). As it 
turns out, this minimal nonlinear lattice-dynamics (also proposed in the context of wetting 
phenomena [9]) generates one of the simplest known one-dimensional models with short- 
range interactions which exhibit an exact phase transition in the thermodynamic limit. 
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The PBD model of DNA does more than a "demonstration of principle" regarding the 
denaturation transition. As shown by Cule and Hwa [10] the randomness introduced by base- 
pair heterogeneity results in a structured rounding of the transition (multistep melting) in 
qualitative accordance with experimental observation. Recently, using improved computa- 
tional methods and a new, global set of model parameters, it has been possible to compute 
detailed, multipeak melting profiles of long DNA chains using only sequence information and 
salt concentration as input; although further parameter optimization is probably needed, 
agreement with experiment is impressive [11]. The model has also been used to calculate 
neutron scattering structure factors, also in very good agreement with recent experimental 
results [12]. Furthermore, as will be noted in this work, the low- frequency optical phonons of 
the model are in the range where Raman spectroscopy has located some vibrational activity 
in DNA [13] . 

Important issues remain open. One of the key advantages of the PBD approach is its 
potential ability to describe local openings ("denaturation bubbles") of the double helix, 
such as those known to occur during the initial stage of the transription process. The rela- 
tively slow dynamics involved in these events would be consistent with the coarse graining 
implicit in the model. Bubble statistics has been extensively studied in the framework of the 
PBD model, and some specific correlation with transcription initiation sites has in fact been 
claimed [14j and debated [151116] . On a more modest - and yet quite fundamental - level, 
the formation of a single bubble has been observed by studying the melting of specially 
designed oligomers [T7] . As the results reported here will show, although the process and its 
rough temperature dependence are correctly accounted for by the PBD model, important 
details, notably the asymmetric biphasic character of the profile, are missing. This may be 
due to the fact that the global parameter set used in PBD modeling is either (i) not yet 
fully optimized or (ii) lacking in detail, e.g. in not taking proper account of the variations 
in stacking interaction between different sets of neighboring base pairs. 

The paper is structured as follows: Section 2 describes the model and introduces sys- 
tem parameters to be used. Section 3 presents thermodynamics, including some necessary 
background material, and focuses on (i) issues related to finite chains and (ii) heterogeneity. 
Section 4 presents computed melting profiles for both long and short chains and compares 
theoretical vs. experimental results. Section 5 discusses the statistical properties of extended 
objects, i.e. bubbles and clusters consisting, respectively, of consecutive open and bound 
base pairs. Section 6 summarizes results and perspectives. 

2. Model, Notation, Parameters 

I will use the version of the PBD model proposed in The total potential energy of a 
chain of base pairs can be represented as a sum of local and nonlocal contributions 

N-l N 

Hp = ^Wiyj,y,+i) + Y,Vjiyj), (2.1) 
j=i j=i 

where the transverse coordinate yj represents the separation of the two bases at the jth 
site, the local term, an on-site Morse potential 



V,{y,) = D,{l-e-'^^y^f 
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describes the combined effects of hydrogen-bonding, stacking and solvent acting on the j'th 
base pair, and the anharmonic elastic term 

{y, - Vj+if (2.2) 

models the nonlinear base-stacking interaction. The parameter k describes the strength of 
the linear stacking interaction, i.e. the "residual stacking" which characterizes the disordered 
state. Most of the stacking energy which favors values yj ~ Uj+i ~ comes from the 
nonlinear term which is parametrized by its dimensionless strength p and its range 1/6. 
In view of the very large ratio of double-stranded and single-stranded DNA persistence 
lengths [18], a value of p = 50 appears reasonable. A stacking interaction range 1/b = 5A 
is consistent with the double helical geometry (diameter 21 A). The other parameters are 
the depth Dj and the range l/cxj of the Morse potential. This work follows the choice of 
parameters made in [11], i.e. k = 0.45 meV/A^, ace = 6.9A~^, oat = 4.2j4~^ and 

Dec = 0.1655 + 0.00615 log — ^ eV 

0.075 

Dat = 0.1255 + 0.00855 log — ^ eV (2.3) 

0.075 

where c is the molar salt concentration. It may be noted that although these particular 
parameters were derived by fitting melting profiles, the resulting optical frequencies (with 
an effective mass per site equal to 618 amu) are about 83 cm~^ for GT and 44 cm~^ for AT. 
These values compare favorably with the results of low-frequency Raman spectroscopy [13] 
which identify a broad band below 85 cm~^. 



W{yj,yj 



+1) 



l + pe 



-biVj+Vj+i) 



3. Thermodynamics 
3.1. Definitions 

The thermodynamics of the model (|2.ip is determined by the interplay between the elastic 
interaction and the Morse potential. At low temperatures, the displacements {yn} oscillate 
around the minima of the Morse potential, y„ = 0; phonon spectra have a gap. At high 
temperatures displacements are large, oscillations occur on the fiat top of the Morse poten- 
tial, phonons are essentially gapless. As a result, thermodynamic properties are those of 
the harmonic chain. The classical partition function develops an infrared divergence, due 
to contributions from long-wavelength phonons. Some consequences of this divergence are 
fairly trivial, e.g. the emergence of a divergent temperature-independent, entropic contribu- 
tion. Others, in particular when it comes to oligomers, are critical. In order to keep proper 
track of this divergence, I will introduce the size of the allowed phase space L explicitly. 
The classical partition function of an open-ended chain with N base pairs is thus defined 

by 

Zn{L)= dyi--- r dyNe-^""^ , (3.1) 

J—oo J—oo 

where f3 = l/^ksT), ks is the Boltzmann constant and T the system temperature. Note 
that, because of the repulsive core of the Morse potential, contributions to the partition 
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function for negative displacements decay quite rapidly. Thus, even though numerical com- 
putations always involve a lower cutoff Umin of the integrations, if the cutoff is appropriately 
chosen, the values of the integrals remain practically independent of the particular value of 
that cutoff. 

A quantity which is of direct experimental interest is the fraction of bound (or, respec- 
tively, open) pairs. The microscopic average which is necessary to compute this is the 
probability that the nth base pair is bound, i.e. 




where ?/c is a - somewhat arbitrary - crossover distance which distinguishes open from bound 
base pairs. The calculations presented in this paper use a value ?/c = 2 A. 
The fraction of open pairs is then given by 

1 ^ 

n=l 

3.2. Preliminaries. The homogeneous case 
3.2.1. The transfer integral equation 
In the homogeneous case 

Dj = DAT.OLj = a AT, Vj 

the partition function can be calculated in terms of the spectra of the transfer integral 
equation, 

J — oo 

where 

K{y,y') = (.-PW{y,y')^-f)[VAT{y)+VAT{y')y2 (^35^ 

will be used in this work as a reference kernel, and the eigenfunctions are normalized to 
unity. Fig. [1] shows the numerically calculated lowest eigenvalues for different values of L. 
The discretization of the integral equation was performed using Gauss-Legendre quadratures 
in the interval {ymin = — 1.5yl,L), where L varies between 100 and 400 A and the density 
of the mesh has been kept constant at 4 points per A. The figure provides a numerical 
illustration of the existence of a well defined limit L ^ 00 of the integral equation (|3.4|) . 
Note in particular the rapid (quadratic) convergence of the successive numerical estimates 
of the critical temperatures to the limiting value. 
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Fig. 1. The left panel shows the largest eigenvalue of (|3.4[) as a function of temperature for a variety of upper 
cutoffs L. The right panel illustrates the dependence of the critical temperature (estimated by the position 
of the maximum of the second temperature derivative of Aq) on the cutoff L. 



3.2.2. The partition function 

The partition function (|3.ip can be written in terms of the spectra of ()3.4p as 

Z;v(L) = J^A^i/,^ (3.6) 



where the sum runs over ah eigenstates, 

•J 

and the L-superscripts have been dropped from eigenvalues and eigenfunctions for the sake 
of notational clarity. It should be understood that any eigenfunction and/or eigenstate is 
the result of a numerical calculation performed with a cutoff and on a grid. 
The free energy per lattice site (base pair) is given by 



/3fN{L,T) = j^ln Zn{L) = (1 - l)lnAo + ^ In 



The above form is useful because it illustrates important limiting behavior. 



(3.7) 



3.2.3. The infinite chain 

In the thermodynamic limit, N — )• oo, terms of order in (|3.7|) vanish. This kills all 
infrared-divergent terms and leaves only the contribution from the highest eigenvalue; for 
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that contribution however, the Umit L — >■ oo exists (Fig. [T]). The singularity of the spec- 
trum of (j3.4|) where the bound state merges into the continuum generates an infrared-free 
singularity of the thermodynamic properties of the infinite chain. The question of the order 
of the transition has been extensively discussed in the literature. The dichotomy between 
apparent (first-order) and exact limiting (continuous) behavior has been presented in some 
detail in the case of the helicoidal version of the PBD model J9] . The distinction is generic 
and applies to the model used here as well. 



3.2.4. The finite chain 

Fig. [2] (left panel) shows numerical results for the free energy per site in the case = 20. 
The transition is much more rounded, as would be expected from finite-size corrections. 
Moreover, the infrared divergent terms are significant on a visible scale. The concomitant 
effect on the melting profile can be seen on the right panel. There is no sign of convergence 
of the melting curve; as the inset indicates, the sequence of effective melting temperatures 
cannot be used in a straightforward way to produce a limit. 
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Fig. 2. The left panel shows the free energy per site (|3.7p as a function of temperature for N=20 and a 
variety of upper cutoffs L. The right panel shows the melting fractions for N=20 and varying L; the inset 
illustrates the dependence of the critical temperature (estimated by the peak in the temperature derivative 
of the melting curve) on the cutoff L. 



3.2.5. The long chain 

The difficulties encountered in the previous section would encourage the following "fun- 
damentalist" argument regarding melting of any finite chains: since, for any finite A^, the 
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integral in (j3.2p remains finite in the limit L oo, whereas the partition function diverges, 
Pn should vanish at any nonzero temperature; it would follow that 6 = 1 for any T > 0. 
How watertight is this formally correct "proof? 

In order to make this quantitative, let me go back to the free energy function. Suppose 
I would like the finite-size, infrared divergent corrections (cf. Fig. [2]) to remain very small, 



where e is some small number. On the other hand, one must note that the L's used are not 
entirely arbitrary. L must be large enough to ensure proper convergence of the TI equation 
()3.4p . For the set of parameters used here, it seems that values between 300 and 400 A 
should be sufficient. One may then argue that if, in addition, (j3.8p holds, the thermodynamic 
properties of the finite chain with values of L in this order will be numerically stable, i.e. L- 
independent. What might happen at "astronomic" values of L can have no practical bearing 
to DNA melting. 

A typical choice of e = 0.01 and L = 400 A suggests that N > 600 is a long enough chain 
to produce such numerically stable melting profiles within the PBD model. Interestingly, this 
is in line with what model-independent estimates based on analyzing melting temperatures 
of natural DNA classify as the limit of a long chain. This may however be too conservative 
a choice when it comes to PBD model considered here. Fig. [3] shows that for = 100 the 
melting profiles are numerically stable in the sense described above and produce a linearly 
convergent sequence of melting temperatures. 

3.2.6. The double- stranded ensemble 

It is possible to define an infrared divergence-free thermodynamic average quantity by con- 
sidering the conditional probability of any base pair being open, subject to the condition that 
the strands are not totally separated pO]. The use of such a "double stranded ensemble" is 
standard procedure in analyzing DNA melting profiles of oligomers in terms of statistical 
(Poland-Scheraga type) models, because there the theory is entirely formulated in terms 
of this conditional probability (internal melting fraction). The probability of fulfilling this 
condition (external melting fraction), which is necessary in order to analyze experimental 
data, must then be obtained or estimated by other means. Within the PBD formalism 
the internal melting fraction can be calculated by first considering the (infrared-divergent) 
statistical weight corresponding to "total melting", i.e. |20j . 



e.g. 




(3.8) 




(3.9) 



The quantity 



Pext{L) = 1 - 9ext{L) = 1 



Zn{L) 



(3.10) 



expresses the probability that at least one base-pair is bound. The conditional probability 
p* that the nth base pair is bound, provided that the chain is not entirely open, substitutes 
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Fig. 3. The melting fractions for N=100 as a function of temperature and varying L; the inset illustrates the 
dependence of the critical temperature (estimated by the peak in the temperature derivative of the melting 
curve) on the cutoff L. 

Z]\f{L) — Z^{L) for Zn{L) in the denominator of (|3.2|) and should be free of infrared 
divergences. The same should hold for the internal melting fraction 



Fig. m illustrates the use of the double-stranded ensemble for oligomers. The dashed 
curves are repetitions of the melting profiles of Fig for N = 20, L = 100, 400. The dotted 
curves represent pext in the two cases. The solid curves, which coincide, are the internal 
melting fractions 9int- 

I will return to a constructive use of the double-stranded ensemble in Section 14.21 

3.3. Thermodynamics of the heterogeneous case 
3.3.1. From multiple integrations to matrix products 

In the absence of translational invariance the partition function (|3.ip must be calculated 
directly as an iV-dimensional integral; to do this in real-space, with a typical mesh of 1201 
points (which is what acceptable numerics demands, corresponding to an L = 300 A) is an 
almost impossible task for N > 100. This is not just a matter of computing time, but also 
of maintaining numerical accuracy. 





n=l 
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Fig. 4. The dashed curves represent melting fractions for a short (N=20), homogeneous AT chain, computed 
for L = 100, 400. The corresponding external melting fractions are represented by the dotted curves. The 
solid curves (on top of each other) are the internal melting fractions. 

A useful alternative is to exploit the eigenfunction expansion of the reference (AT) kernel 

K{yj,yj+i) = '^Au,(l}v,{yj)4>u,{yj+i) (3.12) 

in order to transform the A^-dimensional integration in real space to a matrix multiplication 
in a restricted eigenfunction space, e.g. by demanding that the v summation be restricted 
to those states which satisfy hy/KQ < 10^^. Using the auxiliary quantities 

4il. = ^M^ r dy0.,(y)<^.,(y)e-/^^^^(^) j=2,--- ,N-1, (3.13) 

where Al^ (y) = Vccd/)— ^AT(y) if 3 is a GC site and if j is an AT site, it is straightforward 
to rewrite the partition function as 

z^(L) = Ar^ 41)4?L4^l3---41-l_.4^L (3-14) 

U-i,V2---l'N-l 

or, in more compact matrix notation, 

Zm{L) = A^-i < A(i) 1b(2) . . . b(^-^) I A(^) > . (3.15) 




10 A'. Theodorakopoulos 



Computing the probability of the nth base pair being in the bound state involves trans- 
forming the numerator of ()3.2p in a similar fashion. I use script letters to define vector 
components similar to ()3.13p . with an upper limit equal to yc, i.e. 

An 



yc 



B^L = ' , / 0.,(T/)0.,(|/)e-/^^^^(^) (3.16) 



oo 



and similarly for A^^^ Then 

_ <^W|b(2)---b(^-^)|aW > 

~ < A(i)|B(2)...B(^-i)|AW > 

< A(i) |B(2) . . . . . . b(^-i) I A(^) > 

< A(i)|B(2) ...B(^-i)|AW > ri = 2,---,N-l 

_ < aW|b(2)---b(^-i)|^w > 

~ < A(i)|B(2) ...B(^-i)|AW > ' ^ ' 

3.3.2. Further improvements in computational efficiency 

Computing pn for all sites involves heavy duplication of matrix multiplications. It is com- 
putationally much more efficient to define, compute and store intermediate results in vector 
form. Moreover, it is necessary to renormalize vectors to unity at every stage of a matrix- 
vector multiplication. The latter procedure minimizes numerical error because it keeps com- 
puted scalar products in the order of unity. 

Let the unit vectors at the ends be defined via 



(i)>= 1 |AW> (3.18) 



<u«| = < AW|i^ (3.19) 

where the ^'s are the norms of the respective unnormalized vectors. I then store the result 
of each successive matrix-vector multiplication in vector form, as a unit vector and a norm, 
i.e. 

B(^-^-)|v(^') > = ^f+ilv^-^'+i) > (3.20) 
< u(^')|b(J'+i) = < u^-^'+^Vt+i i = 1,2, • • • , - 2. 
It then follows that (i) 

7 — \N-l,,L ,,L,R ,,R ^ „(«)u.(Af--n) ^ 91 ^ 

for any choice of n = 1, 2, • • • , — 1, and (ii) 

1 < u("~^)|S(")|v(^~") > 

Pn = —r 7-r. — 777 — ^ n = 2,---,A^ — 1 

/i^ <uW|v(^-")> 

1 <^W|v(^-i) > 
~ /if < u(i)|v(^-i) > 

^^-;i<u(^-i)|v(i)> • 
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Computing and storing the necessary intermediates for a melting profile of a chain with N 
base pairs in vector form is a process with 0{N) computational steps. 



4. Computed melting profiles 
4.1. Long chains 

The procedure described in the previous sections has been used [11] to compute melting 
profiles of large genomic sequences. As an example I show in Fig. [5] the melting profile of 
the T7 phage, a sequence of 39937 base pairs with a 48.4% GC content. The experimental 
data from [5T] is also included for comparison. Note that the depths of the Morse wells have 
been fitted to the data; in fact, this is one of the two sets of data which were used in [TI] 
to derive the parameters entering Eqs. ()2.3p . 

In order to relate melting profiles to sequence details, it is customary to draw a melting 
map, in which each site is characterized by its melting temperature, defined via pi{Tm{i) = 
1/2. The melting map for the T7 phage is shown in the right panel of Fig. [5l along with 
the moving 200-pt average of local GC-content. 




Fig. 5. Left panel: The differential melting curve of the T7 phage. The full line shows theoretical results 
based on the PBD model, the dashed curves experimental results redrawn from [21] • Right panel: the melting 
map (left y-scale), 200-site moving average of GC content (right y-scale). The inset shows a zoomed region 
of the melting map. 



Two comments are in order here. First the obvious. Genomic heterogeneity destroys the 
sharp first order transition characteristic of the homogeneous PBD model (and of actual 
melting profiles of long polynucleotide chains). Remnants of multistep melting can be seen 
in the inset of Fig. \5\ which suggests melting of 200-bp chunks. This is in general accord with 
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theoretical expectations on the effect of disorder on phase transitions (reduction of effective 
dimensionahty, hence enhancement of the role of fluctuations, rounding of an incipient 
transition). On the other hand, in an interesting analysis of the effects of heterogeneity 
within the PBD model context, it has been argued [TO] that multistep behavior should be 
self-averaging, albeit with a large crossover length. I have not detected any systematic signs 
of self-averaging, even in genomic sequences of much larger lengths than the one presented 
here. 

The second comment concerns the change in entropy. The left panel of Fig. [5] shows 
the dependence of the entropy derivative on temperature (right y-scale). The curve is 
almost perfectly superimposed on the melting profile. The total entropy change, AAks 
per base pair, is considerable smaller than the typical experimental values [B] of (24 
cal/mol/bp/K), but represents a considerable improvement from the Iks obtained using 
previous parametrizations [22j of the PBD model. 

4.2. Short chains 

An interesting example of a short chain is the LA%AS sequence, which exhibits biphasic 
melting behavior at c = 50mM salt concentration [17j. Fig. O shows the (L-dependent) 
melting curves in the left panel. The curves were obtained using Morse potential depths 
according to ()2.3p . The extracted sequence of melting temperatures (inset, left panel) is 
regular. Melting behavior is of the single-phase type. This disagrees with experimentally 
observed behavior, which is not surprising; it should be quite clear that physical separations 
of the two DNA strands of the order of 300 — 400 A should be totally irrelevant to the study 
a fluctuating 48-bp molecule (which has a total length of 170 A). 

The obvious alternative within the context of the PBD model is the double-stranded 
ensemble (cf. I3.2.6|) . Melting profiles are shown in the middle panel of Fig. El they are 
practically L-independent. In the same panel I have included an estimate (dotted curve) 
which mimics the experimentally observed Oext ; in terms of relative temperature position 
and width. Making use of this estimate it is possible to calculate the total melting profile, 
which does indeed exhibit biphasic behavior. The differential melting curve is shown in the 
right panel, along with the corresponding experimental results of Ref. [17j. 

In spite of this qualitative agreement, there are still considerable problems with the 
theoretical description of this exemplary case of biphasic melting in DNA oligomers. First, 
the predicted peaks in the melting profiles lie more than 10 degrees higher than the exper- 
imentally observed ones. Second, the predicted biphasic behavior is of the wrong type, i.e. 
melting accelerates at higher temperatures (rather than slowing down, as experimentally 
observed). This may well signal a fundamental difficulty of describing oligomer melting 
behavior in terms of the PBD model. 

5. Statistics of bubbles and clusters 
5.1. Bubbles 

"Denaturation bubbles" (here to be referred to in brief as bubbles) are extended regions 
of space where the displacement exceeds the critical value Uc- The probability of having a 
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Fig. 6. Left panel: The melting curve of the L48AS sequence for a variety of L values. The inset shows 
the estimated melting temperatures. Middle panel: Melting curves in the double-stranded ensemble for a 
variety of L values. In addition, the dotted curve describes approximately the external melting fraction 9ext- 
The total fraction of open base pairs is described by the thick solid curve. Right panel: the calculated total 
differential melting curve. Inset: experimental results, reproduced from Fig. 2 of ref. [T^ with permission. 



bubble of length k at site n is (cf. (j3.17p ) 

< A(1) |B(2) . . . B{n~k)^{n-k+l) . . . Q{n)-Q{n+1) . . . ^{N-1) |^(7V) > 

<A(i)|B(2)... 6(^^-1)1 AW > ' ^^'^^ 

where B^"''^ is the complement of B^'^\ i.e. 

and n = k + 1, ■ ■ ■ , N — 1, i.e. this is a backward-counted, "internal" bubble. In terms of 
the stored unit vectors and norms of section 13.31 this can be rewritten in the form 



Q{n\k) = -. ^— ' , ,^ ' . (5.3) 

Note that a strict definition of a bubble should include the demand that the enclosing sites 
n — k and n + 1 should be occupied by bound base pairs. The probability for this to occur 
will be 

Q{n\k) = Q{n\k) - Q(n + + 1) - Q{n\k + 1) + Q(n + + 2) 

For very long chains, it is possible to average over all sites (A^' < to exclude boundaries) 
and obtain the site-averaged probability 

n 
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Fig. 7. A schematic view of the bubble size probability distribution. Starting at the nth site (bound base 
pair, upward pointing arrow), all possible sequences to the left of n are shown. Either there is no bubble (no 
downward pointing arrows), or the bubble terminates after 1,2, - ■■ k sites. Since the list is exhaustive, the 
sum of all probabilities must equal the probability of the nth site being in the bound state. Similarly, if the 
first row is excluded, the sum of all remaining probabilities must equal the joint probability of the nth site 
being in the bound and the (n — l)st in the unbound state (flip density). 

for the occurrence of a (strictly) size k bubble somewhere along the chain. Fig. [7] provides 
a schematic illustration of the sum rule 

00 

Y.Qk = i-0 . (5.5) 

k=0 

In fact, as the figure suggests, the sum rule is valid also for the non-averaged probabilities, 
i.e. 

00 

^Qin\k)=pn Vn , (5.6) 

fe=0 

since the sequences shown exhaust all configurations in which the nth base pair is bound 
("up"). 

Another quantity of interest is the density of neighboring bound / unbound pairs (spin 
flips). Going back to Fig. [7] it is straightforward to identify the flip density - which, owing 
to the topology of the one-dimensional chain, is equal to the average bubble density - with 

00 

J2Qk = l-e-Qo, (5.7) 

k=l 

where Qq is the joint probability of two successive sites being in the bound state (the 
zero-size bubble). 
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The average size of a bubble can be defined as 

— -F=^o — 7^ • lO-Oj 

Z^fe=i Vfe 

Noting that the product of average bubble size and density is just the fraction of open 
base pairs leads to a second sum rule, 



f2kQk = . (5.9) 



fc=i 



The above sum rules ()5.5p and ()5.9p represent exact properties of the infinite chain. 
They have been explicitly verified in the homogeneous case [2^. Moreover, since ()5.5p can 
be derived by finite-averaging of (j5.6p it will hold exactly for finite, heterogeneous sequences. 
()5.9p is expected to hold for a long heterogeneous sequence, provided that average bubble 
number and size allow for meaningful statistics. 

As an example of (non-averaged) site bubble distribution in genomic DNA, Q(n|10) 
has been calculated in the case of the T7 phage and is shown in Fig. [51 Note that the 
probability of 10-site bubble occurrence can be substantial, i.e. only slightly smaller than 
the probability of single base-pair opening (estimated to be in the order of 1 ppm by imino 
proton exchange measurements [S]). Clearly, this can only happen in AT-rich regions. The 
figure includes known [25] promoter sites for (a very qualitative and at the present stage 
by no means systematic) comparison. 

5.2. Double- stranded clusters 

Equivalently, one may define the probability of a number of successive base pairs being in 
the bound state. Such a A:-size double-stranded cluster (in brief: cluster) ending on the nth 
site, and surrounded by open base pairs at the ends, would be associated with a probability 

V{n\k) = V{n\k) - V{n + + 1) - V{n\k + 1) + V{n + 1[A; + 2) 

where 

V{n\k) = - ' ■ (5.10) 

and all the arguments of the previous subsection related to bubbles apply mutatis mutandis 
to clusters. Again, for long genomic sequences, one may perform a site average and obtain 
the probability of occurrence of a A;-size cluster anywhere along the chain, 



N' 

n 

The above space-independent probabilities satisfy the sum rules 



oo 

Y^kPk = i 

k=l 

oo 



Y,Pk = . (5.12) 
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T7, T=310 K 

10-site bubble probability 

~ known promoter sites I 




10 15 20 25 30 35 40 

site/1000 



Fig. 8. Probability of occurrence of 10-site bubble in the T7-phage. The circles denote pubhshed promoter 
sites [25]. 

Note that the latter sum includes a quantity Pq which is the (unconditional) probability of 
two successive sites being in the open state (the zero-size cluster). 
The average size of a cluster is 

The quantity 9 — Pq expresses the probability of finding an open base pair, followed by a 
closed base pair. Similarly, the quantity 1 — 9 — Qq is equal to the probability of finding a 
closed base pair, followed by an open base pair. These "flip" probabilities must be equal: 

l-9-Qo = 9-Po ; (5.14) 

furthermore, the average length of a combined cluster and the bubble which follows it, is, 
from dSSl) and (f533]l . 

^'^ + ^b = j^^ , (5.15) 

i.e. 9 — Pq = 1 — 9 — Qq expresses the number density of bubbles and/or clusters. 

Site-averaged probabilities of cluster occurrence are important in understanding exper- 
iments performed with macroscopic samples of genomic DNA, e.g. neutron or X-ray scat- 
tering. Fig [9] shows typical behavior of P„ as a function of n for a variety of temperatures. 
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The sequence used was that of the Tl phage. Note the strictly exponential behavior. The 
slope increases as one approaches melting - and beyond -. Absolute probabilities increase 
slightly with increasing temperature in the ordered phase since, according to the definition 
of a strict cluster, it must be surrounded by open sites. Table 1 provides a quantitative test 
of the averaging process in terms of the sum rules ()5.12p . Summation beyond the explicitly 
calculated values {k = 112) has been extended to infinity using the numerically determined 
slopes. 



Table 1. Test of sum rules ((5TT2|) . 



T (K) 




1 - e 




e 


320 


0.9888 


0.9960 


0.0040 


0.0040 


330 


0.9889 


0.9946 


0.0054 


0.0054 


340 


0.9847 


0.9891 


0.0109 


0.0109 


350 


0.0216 


0.0200 


0.9800 


0.9800 




Fig. 9. Site-averaged probability of occurrence of a size-fc cluster in the T7-phage. 
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6. Conclusions and Outlook 

Mesoscopic lattice-dynamics modeling of DNA has been described in this paper as a pow- 
erful tool for the description of thermal properties. In particular, it has been shown that 
efficient computational techniques can be used to calculate melting profiles of very long 
genomic sequences. The resulting - not yet fully optimized - parametrization is in (full or 
at least partial) accord with results from other independent measurements, e.g. stiffness, 
Raman spectroscopy, calorimetry, neutron diffraction. On the basis of these findings, it 
appears that further work is needed in at least three directions. First, a full-scale optimiza- 
tion of model parameters should be undertaken, involving a significantly larger amount of 
melting data on long chains. Second, an effort should be made to include the computa- 
tionally intensive, physically important effect of heterogeneous stacking |26p27j on bubbles 
in long genomic chains . Third - not necessarily independent of the other two - would be 
to achieve an improved understanding of the collective (static and dynamic) properties of 
specially designed oligomers. 
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